An Efficient Algorithm for Large Scale Compressive Feature Learning

نویسندگان

  • Hristo S. Paskov
  • John C. Mitchell
  • Trevor J. Hastie
چکیده

This paper focuses on large–scale unsupervised feature selection from text. We expand upon the recently proposed Compressive Feature Learning (CFL) framework, a method that uses dictionarybased compression to select a K-gram representation for a document corpus. We show that CFL is NP–Complete and provide a novel and efficient approximation algorithm based on a homotopy that transforms a convex relaxation of CFL into the original problem. Our algorithm allows CFL to scale to corpuses comprised of millions of documents because each step is linear in the corpus length and highly parallelizable. We use it to extract features from the BeerAdvocate dataset, a corpus of over 1.5 million beer reviews spanning 10 years. CFL uses two orders of magnitude fewer features than the full trigram space. It beats a standard unigram model in a number of prediction tasks and achieves nearly twice the accuracy on an author identification task.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

DPML-Risk: An Efficient Algorithm for Image Registration

Targets and objects registration and tracking in a sequence of images play an important role in various areas. One of the methods in image registration is feature-based algorithm which is accomplished in two steps. The first step includes finding features of sensed and reference images. In this step, a scale space is used to reduce the sensitivity of detected features to the scale changes. Afterw...

متن کامل

Online Streaming Feature Selection Using Geometric Series of the Adjacency Matrix of Features

Feature Selection (FS) is an important pre-processing step in machine learning and data mining. All the traditional feature selection methods assume that the entire feature space is available from the beginning. However, online streaming features (OSF) are an integral part of many real-world applications. In OSF, the number of training examples is fixed while the number of features grows with t...

متن کامل

COMPUTATIONALLY EFFICIENT OPTIMUM DESIGN OF LARGE SCALE STEEL FRAMES

Computational cost of metaheuristic based optimum design algorithms grows excessively with structure size. This results in computational inefficiency of modern metaheuristic algorithms in tackling optimum design problems of large scale structural systems. This paper attempts to provide a computationally efficient optimization tool for optimum design of large scale steel frame structures to AISC...

متن کامل

A Trust Region Algorithm for Solving Nonlinear Equations (RESEARCH NOTE)

This paper presents a practical and efficient method to solve large-scale nonlinear equations. The global convergence of this new trust region algorithm is verified. The algorithm is then used to solve the nonlinear equations arising in an Expanded Lagrangian Function (ELF). Numerical results for the implementation of some large-scale problems indicate that the algorithm is efficient for these ...

متن کامل

A hybrid model based on machine learning and genetic algorithm for detecting fraud in financial statements

Financial statement fraud has increasingly become a serious problem for business, government, and investors. In fact, this threatens the reliability of capital markets, corporate heads, and even the audit profession. Auditors in particular face their apparent inability to detect large-scale fraud, and there are various ways to identify this problem. In order to identify this problem, the majori...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2014